Background

Computing is a tool to achieve an objective, not en ends in itself
- Based on experience of PhD in Quantum Physics

Problem

Using machine learning to understand brain function
- Functional MRI; records time/spatial report of brain activity
Learn link between brain activity and cognitive function
- Feature engineering an input to map it accurately to brain function.
- This might inform on how brain is actually doing the mapping.

Visual image reconstruction from brain activity, Miyawaki et al 2008
- Very impressive, but not reproducible.
- Science needs to be reproducible.
Make it work, make it right, make it boring.
Want to robustly and reliably reproduce scientific results, make them boring

This is machine learning!
Not just minimising error; data driven science is using data to derive better models, not just making classifiers.

Labs are like startups
- Recruiting talent, keeping them.
- Limited resources.
- "Bus factor". How many people can be hit by a bus before your project stops.
You really need to engineer software well to survive yourself moving on for whatever reason.
Technical debt
- You need to do the maintenance, documentation, testing, or else your project will inevitably die off.

Iteration goes with consolidation.
- As you iterate you reduce technical debt and get closer to goal.
Academia is moving from statistics to statistical learning (formal machine learning)
- Mainly due to dimensionality of feature sets.
From parameter inference to prediction.

Don't solve hard problems, bend original problems
- Judo technique.
Easy setup.
- Think about installation steps, dependencies, convention over configuration.
Fail gracefully.
- Robust
- Easy to debug (major, key success point of Python. Much easier to debug than C).
Quality.
Don't invent a kitchen sink.
- Narrow your focus as narrowly as possible.
- This increases the bus factor.
- As you need features, create new projects and link.

Presenter is a core contributor.
Vision: machine learning without knowing the math.
- A black box, but one that can be opened.
Apple vs Linux.
- Older geeks tend to use Apple products. Things should just work.
This module can't magically solve feature engineering for you.
- But Python is the perfect language to solve this by yourself.
Sticking to high-level programming keeps scikit-learn alive.
- But how do you stay performant at this high level?
- Optimise algorithms, not low-level stuff.
- Know NumPy and SciPy perfectly.
- All data must be arrays/memoryviews. Avoid memory copies, defer to BLAS/LAPACK.
- Cython.
- scikit-learn actively avoids C/C++.
  - Increases bus factor.
  - New contributors always complain, but this philosophy works.
http://scipy-lectures.github.io

Pull request 2199
How
1. Take two closes clusters
2. Merge them.
3. Update distance matrix
First approach:
- How to find minimum? Heaps!
- Sparse growable strucutres? Skip lists in Cython!
Second approach.
- But C++ map[int, float] is what I need? So wrap it in Cython!

Conceptually, have a big blob of data and operations are agents that walk over the data.
Want an imperative-like language.
- Declarative programming is great in theory but doesn't work in practice.
Core grammar
- fit, predict, tranform, score, partial_fit
Grammar instantiated without data.
Build pipelines around grammar without data.
- Configuration/run pattern. a la traits, pyre.
- This is just convention, very light. You can ignore this if you want, but if you submi a pull request ignoring this you'll get rejected.
a la currying in functional programming.
a la MVC pattern.
APIs are important, and informed by prior art and heuristics, despite how simple they seem

Can't afford Hadoop, and want to use Python end-to-end.
Off the shelf commodity hardware (laptops!)
One trick: online algorithms
- Compute something on element at a time.
- e.g. mean of gazillion numbers? Just do a running mean.
- use algorithms that statistically converge to the true value with some estimatable error.
e.g. K-Means clustering.
- scipy.cluster.vq.kmeans is precise, slow
- sklearn.cluster.MiniBatchKMeans is statistical, much faster.
People complain "I need a cluster to add petabytes of arrays"
- Why?? Use online algorithms.

Remember memory is hierarchical. Reducing data sets allows more to fit in higher levels of the hierarchy.
Take random subset
- Random projection: sklean.random_projection (averages features)
- e.g. Randomized SVD, sklearn.utils.extmath.randomized_svd
  - Their randomized solution is more accurate than other supposedly precise solutions

joblib.Memory, memoize pattern.
Stores very large results on-disk, only returns if you get, and even then only if you iterate over it.
You must write functions, he will never make a context manager.
How to hash input arguments for memoize decorator.
- hashlib.md5, robust, no dependencies.
- Subclass the pickler, which is a state machine that walks the object graph.
- If you walk and find something e.g. ndarrays, don't turn it into a string, just pass a pointer. Avoid copies, use memoryviews.
When persisting objects, again subclass pickle and e.g. np.save big numpy arrays.
How to handle locking of persisting results of the cache?
- Rely on renaming directories being atomic, basic POSIX operation.
Should I compress data to/from disk?
- Single core: faster uncompressed
- Multi core: zlib.compress faster, used again because no dependencies.
- But use it in an online way.
- Copyless compression: store meta-data too.
Challenges
- How to stream large results in a cluster.
- Because too many file is slow, file open is slow on cluster.

200 contributors, ~12 core contributors.
Huge feature is due to this size of team.
Random Forests is getting faster by orders of magnitude because of community contribution.

scikit-learn has a very large number of contributors making a large proportional number of commits.
- Unlike many other Python modules.

How to encourage contributors?
- Avoid GUIs with a passion. Don't do it.
- Just focus on getting users, the rest follows.
- Don't dumb down the problem.